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CHAPTER 3. MAXIMUM LIKELIHOOD AND M-ESTIMATION 

3.1 Maximum likelihood estimates — in exponential families. Let (X,B) be a 
measurable space and {Pg, 9 G 0} a measurable family of laws on (X, £>), dominated by a 
cr-finite measure v. Let f(9,x) be a jointly measurable version of the density (dPg/dv)(x) 
by Theorem 1.3.3. For each x G X, a maximum likelihood estimate (MLE) of 9 is any 
9 = 9{x) such that f(9,x) = sup{f((f),x) : <fi £ ©}• m other words, #(#) is a point at 
which f(-,x) attains its maximum. In general, the supremum may not be attained, or 
it may be attained at more than one point. If it is attained at a unique point 9, then 9 
is called the maximum likelihood estimate of 9. A measurable function #(•) defined on a 
measurable subset B of X is called a maximum likelihood estimator if for all x G B, 9{x) 
is a maximum likelihood estimate of 9, and for ^-almost all x not in B, the supremum of 
f(-,x) is not attained at any point. 

Examples, (i) For each 9 > let be the uniform distribution on [0, 9], with f(9, x) := 
l[o,6]( x )/Q f° r an Then if X±, . . . ,X n are observed, i.i.d. (Pg), the MLE of 9 is -X"( n ) := 
max(Ji,... , X n ). Note however that if the density had been defined as l[ O) 0)(x), its 
supremum for given Xl, . . . ,X n would not be attained at any 9. The MLE of 9 is the 
smallest possible value of 9 given the data, so it is not a very reasonable estimate in some 
ways. For example, it is not Bayes admissible. 

(ii). For Pg = N(9, l) n on IR n , with usual densities, the sample mean X is the MLE of 
9. For N(0, a 2 ) n , a > 0, the MLE of a 2 is £" =1 Xj/n. For iV(m, a 2 ) n , n > 2, the MLE 

of (m, a 2 ) is (X, Y^j=i(Xj — X) 2 /n). Here recall that the usual, unbiased estimator of a 2 
has n — 1 in place of n, so that the MLE is biased, although the bias is small, of order 
1/n 2 as n — > oo. The MLE of a 2 fails to exist (or equals 0, if were allowed as a value 
of a 2 ) exactly on the event that all Xj are equal for j < n, which happens for n = 1, but 
only with probability for n > 2. On this event, f((X, a 2 ), x) — > +oo as a [ 0. 

In general, let O be an open subset of lR fc and suppose f(9,x) has first partial deriva- 
tives with respect to 9j for j = 1, . . . , k, forming the gradient vector 

S7gf(9,x) := {df(9,x)/d9 j }'° =1 . 

If the supremum is attained at a point in O, then the gradient there will be 0, in other 
words the likelihood equations hold, 

(3.1.1) df(9,x)/d9j = for j = l,...,k. 

If the supremum is not attained on O, then it will be approached at a sequence of points 
6>( m ) approaching the boundary of O, or which may become unbounded if O is unbounded. 

The equations (3.1.1) are called "maximum likelihood equations" in a number of 
statistics books and papers, but that is unfortunate terminology because in general a 
solution of (3.1.1) could also be (a) only a local, not a global maximum of the likelihood, 
(b) a local or global minimum of the likelihood, or (c) a saddle point, as in an example to 
be given below. 
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For exponential families, it will be shown that MLE's can be found from the likelihood 
equations, as follows: 

3.1.2 Theorem. Let {Pg, 9 E 0} be an exponential family of order k, where G is the 
natural parameter space in a minimal representation (2.5.3). Let U be the interior of G 
and j(9) := — \ogC(9). Then for any n and observations X\, . . . ,X n i.i.d. (Pe), there is 
at most one MLE 9 in U. The likelihood equations have the form 

(3.1.3) dj/dOi = YP j= iTi{Xj)/n for i = l,...,k, 

and have at most one solution in U, which if it exists is the MLE. Conversely, any MLE 
in U must be a solution of the likelihood equations. If an MLE exists in U for t;- almost all 
x, it is a sufficient statistic for 9. 

Proof. Maximizing the likelihood is equivalent to maximizing its logarithm (the log 
likelihood), which is 

-^W + E^iEti^^), 

and the gradient of the likelihood is if and only if the gradient of the log likelihood is 
0, which evidently gives the equations (3.1.3). Then K = e J is a smooth function of 9 on 
U by Theorem 2.5.8, hence so is j, and the other summand in the log likelihood is linear 
in 6>, so the log likelihood is a smooth (C°°) function of 9. So at a maximum in U, the 
gradient must be 0, in other words (3.1.3) holds. 

A real- valued function / on a convex set in M. k is called concave if — / is convex and 
strictly concave if — / is strictly convex. It is easily seen that a strictly concave function on 
a convex open set has at most one local maximum, which then must be a strict absolute 
maximum. Adding a linear function preserves (strict) convexity or concavity. So, to show 
that the log likelihood is strictly concave on U is equivalent to showing that j is strictly 
convex on U. 

By Corollary 2.5.9, the matrix of second partial derivatives of j is the covariance 
matrix of the components of T. Since the representation is minimal, the covariance matrix 
is non-singular: otherwise there would be a non-zero linear combination of the Tj with 
variance and so equal to a constant almost surely. Since a covariance matrix is always 
nonnegative definite, in this case it is positive definite. Then, restricted to any line segment 
included in U, j has a strictly positive second derivative along the segment by the chain 
rule, which implies that j is strictly convex. For any strictly concave function h on a convex 
open set, if the gradient of h exists and is at a point 9, then h has a strict local maximum 
at 9, as can be seen first in the one-dimensional case, then taking all lines through 9 in 
general. Then 9 is a strict global maximum of h as desired. 

If for almost all x, (3.1.3) has a solution 9 = 9(x) in U, which is then unique, then by 
(3.1.3), the vector {E?=i ^i(-^')}i=i> which is a sufficient /c-dimensional statistic as noted 
in Section 2.5, is a function of 9(x) which thus must also be sufficient. □ 

Next is an example to show that if a maximum likelihood estimate exists almost surely 
but may be on the boundary of the parameter space, it may not be sufficient. Let v be 
the law with density 2x~ 3 on [1, oo) and elsewhere. Consider the exponential family of 
order 1 having densities C(9)e 9x with respect to v, where C{9) as usual is the normalizing 
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constant. Then the natural parameter space is (— oo,0]. According to Barndorff- Nielsen 
(1978, p. 153), if x > 2 is observed the MLE is 9 = 0. This can be proved as follows. We 
have K(6) = 2 e x~ 3 dx. Since the density is nonincreasing in x for any 9 < it is 
enough to show that e < K{6) for any 9 < 0. Some classical special functions are defined 
for t > and n = 0, 1, 2, by E n (t) := /f° e~ tx x~ n dx (Gautschi and Cahill, hereafter 
G&C, (5.1.4)). Thus g(t) := K(-t) = 2E 3 (t) for alH > and we want to show this is 
larger than e~ 2t . Both functions equal 1 at t = 0. 

Case 1: < t < 1/2. It will be enough to show that g'(t) = -2E 2 (t) > -2e" 2t , or 
equivalently E 2 (t) < e~ 2t for < t < 1/2. We have E 2 (t) = e~ l — tE\(t) (integration 
by parts, G&C 5.1.14), so we want to show that tE^t) > e - * - e~ 2t . By G&C, (5.1.20), 
E 1 (t) > (e _t /2) log(l + |), so it will be enough to show that 



h(t) 



- log 

2 6 



1 + - -1 + e - * > 0, 



< t < 0.5. Note that e' 1 >l-t+\t 2 - for all t > 0, since both sides are equal at 
t = 0, then taking derivatives and iterating. Thus it's enough to show that 



t, A 2 
2 l0g [ 1+ t 



-t+-t--t 6 > 



for < t < 1/2, which is equivalent to 



t 
2 



log 1 + 



t 



2 + t-lt 2 



> 0, or log ( 1 + - j - 2 + t- h 2 > 0. 



Let fit) := log (l + f ) and g(t) := 2 

1.6094 > g(l/2) = 19/12 = 1.5833. So it suffices to check that /'(£) 



t + \t 2 . We can check that /(1/2 



log 5 = 
2/[*/(* + 2)] < 



^(t) = f - 1 for < t < 1/2. So we need to check that 6t < 2t 3 + t 2 + 6 which holds 
since 6t < 6. 

Case 2: t > 0.33. On this half-line we use the continued fraction for Es(t) given by 
G&C, (5.1.22), and the fact (noticed by Euler) that for continued fractions with all terms 
positive, successive convergents are alternately above and below the value of the continued 
fraction. This gives us a lower bound 2E^it) > 2(t + 5)e~ t /[t 2 + 8t + 12] which we want to 
prove larger than e~ 2t . Equivalently, we want to prove j it) := 2(t + 5)e* — t 2 — 8t — 12 > 
fort > 0.33. This can be directly verified for t = 0.33. Wehavej'(t) = (2t + 12)e*-2t-8 = 
2t(e* - 1) + 4(3e* - 2) > for all t > 0, so j(t) > for all t > 0.33 as desired. 

So, is the MLE for any observation x > 2 as stated. But, the identity function 
x is a Lehmann-Scheffe sufficient statistic by factorization and Theorem 2.5.10, therefore 
minimal sufficient by Theorem 2.3.3, so the maximum likelihood estimator is not sufficient 
in this case although it is defined almost surely. 
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PROBLEMS 



1. Let Xi, ...,X n be i.i.d. Poisson with unknown parameter A, < A < oo. 

(a) As a function of A, find the probability that a MLE exists in the parameter space. 

(b) When the MLE exists, show that it can be found from the likelihood equation and 
evaluate it. 

2. In the example where for x > 2 the MLE is 0, evaluate cq := l°_ 00 K{9)d9. Using 
dit{9) = K{9)d0 /co, evaluate the Bayes estimator of 9 for squared-error loss. 

NOTES 

For the example in which for x > 2 the MLE is 0, M. Manstavicius provided some 
steps in the proof. 
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